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Abstract 

The proper biological functioning of proteins often relies on the occurrence of coordinated fluctuations 
around their native structure, or of wider and sometimes highly elaborated motions. Coarse-grained elastic- 
network descriptions are known to capture essential aspects of conformational dynamics in proteins, but have 
so far remained mostly phenomenological, and unable to account for the chemical specificities of amino acids. 
Here, we propose a method to derive residue- and distance-specific effective harmonic potentials from the 
statistical analysis of an extensive dataset of NMR conformational ensembles. These potentials constitute 
dynamical counterparts to the mean-force statistical potentials commonly used for static analyses of protein 
structures. In the context of the elastic network model, they yield a strongly improved description of the 
cooperative aspects of residue motions, and give the opportunity to systematically explore the influence of 
sequence details on protein dynamics. 



Introduction 

Deciphering the motions that underlie many aspects of 
protein function is a major current challenge in molecu- 
lar biology, with the potential to generate numerous ap- 
plications in biomedical research and biotechnology. Al- 
though molecular dynamics (MD) hold a prominent po- 
sition among computational approaches, considerable ef- 
forts have been devoted to the development of coarse- 
grained models of protein dynamics T] . Besides their abil- 
ity to follow motions on time scales that are usually not 
accessible to MD simulations, these models also give the 
possibility to better understand the general principles that 
rule the dynamical properties of proteins. 13 A. 

The elegant simplicity of the elastic network models 
(ENM) certainly contributed to their popularity, and they 
have been successfully exploited in a wide range of ap- 



plications [2j[3j. In these models, the residues are usu- 
ally represented as single particles and connected to their 
neighbors by Hookean springs |4J|5]. The input structure 
is assumed to be the equilibrium state, i.e. the global 
energy minimum of the system. Common variants in- 
clude the homogeneous ENM, in which springs of equal 
stiffness connect pairs of residues separated by a distance 
smaller than a predefined cutoff, and other versions in 
which the spring stiffness decays as the interresidue dis- 
tance increases [bHs]. In all cases, the equations of motion 
can be either linearized around equilibrium, to perform a 
normal mode analysis of the system [9-11 , or integrated 
to obtain time-resolved relaxation trajectories |12[|13|. 

Despite their many achievements, purely structural 
ENM also come with severe limitations. Notably, model- 
ing the possible effects of mutations within this framework 
usually requires random local perturbations of the spring 
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constants [m], o r a more drastic removal of links from 
the network |15|. A few attempts have been made to in- 
clude sequence-specificity in the ENM by setting the spring 
constants proportional to the depth of the energy min- 
ima, as estimated by statistical contact potentials [T6{[l7| . 
However, this approach cannot be extended to distance- 
dependent potentials, for they are not consistent with the 
ground hypothesis of the ENM, i.e. that all pairwise in- 
teraction potentials are at their minimum in the native 
structure. Other studies have led to the conclusion that 
the ENM behave as entropic models dominated by struc- 
tural features, and that the level of coarse-graining is prob- 
ably too high to incorporate sequence details [5 18 . Still, 
the chemical nature of residues at key positions can have 
significant effects on the main dynamical properties of a 
protein. Hinge motions 19 , for instance, obviously re- 



quire some architectural conditions to be fulfilled, such as 
the presence of two domains capable of moving relatively 
independently. But the amplitude and preferred direction 
of the motion are most likely determined by fine tuning 
of specific interactions in the hinge region. In proteins 
subject to domain swapping, the hinge loops have indeed 
been shown to frequently include residues that are not op- 



timal for stability 20 . The importance of the amino acid 



sequence has also been repeatedly emphasized by experi- 
mental studies of the impact of mutations on the confor- 
mational dynamics of proteins 



21 23 



A major obstacle to the definition of accurate coarse- 
grained descriptions of protein dynamics lies in the highly 
cooperative nature of protein motions, which makes it dif- 
ficult to identify the properties of the individual building 
blocks independently of the overall architecture of each 
fold. By condensing the information contained in a mul- 
titude of NMR ensembles, we build here a mean protein 
environment, in which the behavior of residue pairs can 
be tracked independently of each protein's specific struc- 
ture. This methodology brings an efficient way of assessing 
coarse-grained models of protein dynamics and of deriving 
effective energy functions adapted to these models. In the 
context of the ENM, we identify a set of spring constants 
that depend on both the interresidue distances and the 
chemical nature of amino acids, and that markedly im- 
prove the performances of the model. 

Results 

Dynamical properties of proteins from the per- 
spective of an average pair of residues 

The mean-square fluctuations of individual residues 
(MSRF) have been extensively relied on to characterize 
protein flexibility and to evaluate coarse-grained models of 
protein dynamics f24' , in part because of their widespread 
availability as crystallographic B-factors. However, since 
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Figure 1: Schematic illustration of the apparent stiffness 7. A 
simple model containing 8 beads connected by elastic springs 
was subjected to 10^ integration steps under Gaussian noise. 
Selected values of < (ARi)^ >, 7 and 7° are given in arbitrary 
units. Individually, the pairs A-B and C-D would be identi- 
cal, but they experience differently the influence of the other 
beads. As a result, the C-D pair is effectively more rigid than 
A-B (7AS < yen)- In both cases, the motions are somewhat 
correlated, as the apparent stiffness 7 is larger than what is 
expected from the knowledge of their individual motions (7°). 
Beads A and E do not interact directly but the effect of the 
network on their relative motions is captured by the values of 
7AE and 7^^. 



the MSRF carry little information about the cooperative 
nature of residue motions, we propose to examine the 
dynamical behavior of proteins from the perspective of 
residue pairs rather than individual residues. Informa- 
tion about the fluctuations of interresidue distances is con- 
tained in the data of NMR experiments for numerous pro- 
teins, and will be exploited here. We define the apparent 
stiffness of a pair of residues i,j in a protein p: 



7pi] = '^ksT/al 



(1) 



where ks is the Boltzmann constant, T the temperature, 
and tJ^ the variance of the distance r between residues i 
and j , in a structural ensemble representative of the equi- 
librium state, "fpij is defined up to a multiplicative factor, 
which corresponds to the temperature. We also introduce 
the uncorrelated apparent stiffness 7°^,, to quantify the 
impact of the individual fluctuations of residues i and j 
on the fluctuations of the distance that separates them. 
This is achieved by using a° ._ instead of ar^^j in eq. 1, 
where a° is computed after exclusion of all correlations 
between the motions of residues i and j (Methods) . 

As illustrated in Figure 1, 7 can be quite different from 
one residue pair to another. Indeed, besides the impact 
of direct interactions, 7 is also strongly dependent on the 
overall fold of the protein, and on the position of the pair 
within the structure. To remove the speciflc influence of 
each protein's architecture, we deflne the apparent stiffness 
in a mean protein environment 7(5, d): 



7(s,d) 



=r2(s,d)' 



with cr^(s, d) 



Z^p=l Z^ 



Np{s.d) 



Mpul 



Ep=i^p(s,rf)Mp 



(2) 

where s is one of 210 amino acid pairs, d the discretized 
equihbrium distance between pairs of residues {d < rpij < 
fi + 0.5A), Mp the number of structures in the equihbrium 
ensemble of protein p, and Np{s,d) the number of {s,d) 
residue pairs in protein p. Pairs of consecutive residues 
were dismissed, so as to consider only non-bonded inter- 
actions. The mean protein environment is thus obtained 
by averaging over a large number of residue pairs in a 
dataset of P = 1500 different proteins {Methods). 

The influence of the distance separating two residues 
on the cooperativity of their motions can be investigated 
by considering amino acid types indistinctively in eq. 2. 
Interestingly, j{d) follows approximately a power law, with 
an exponent of about -2.5 (Fig. 2). Finer details include a 
first maximal value occurring for Cq-Cq. distances between 
5 and 5.5 A, i.e. the separation between hydrogen-bonded 
residues within regular secondary structure elements, and 
a second around 9 A, which corresponds to indirect, second 
neighbor, interactions. The high level of cooperativity in 
residue motions is well illustrated by the comparison of 
7((i) and its uncorrelated counterpart 7° (d) . Indeed, these 
two functions would take identical values if the variability 
of the distance between two residues could be explained 
solely by the extent of their individual fluctuations. In 
a mean protein environment, however, 7(d) is about two 
orders of magnitude larger than 7° (d) at short-range, and 
the difference remains quite important up to about 30-40 

A. 

The comparison of ^{d) values extracted from subsets 
containing exclusively small, large, all-a, or all-/3 proteins 
indicates that the content of the dataset has a remarkably 
limited impact on 7(d) (Supplementary Fig. 1). This dis- 
tance dependence can thus be seen as a general property 
of protein structures, a signature of protein cooperativity 
at the residue pair level. Of course, since 7(d) is repre- 
sentative of a mean protein environment, deviations may 
occur for individual proteins, according to their specific 
structural organizations (Supplementary Fig. 2). 

The apparent stiffness 7(5) is computed for each type of 
amino acid pair s using eq. 2, by considering only residue 
pairs separated by less than 10 A. As shown in Figure 
3A, the chemical nature of the interacting residues is a 
major determinant of their dynamical behavior. Unsur- 
prisingly, Glycine and Proline appear as the most effective 
ingredients of flexibility. Pairs involving hydrophobic and 
aromatic amino acids tend to be considerably more rigid, 
with 7(5) values up to 6 times larger. These differences 
originate in part in the individual propensities of different 
amino acids to be located in more or less flexible regions 



CO 
Q. 
Q. 

< 







\ A 


1 


: 


in 




-\ v^ 




- 


1 




-^^^=^ 


%. 


- 






■^ ^ 


— ' ' ^\\ 


- 






^ ~\_^y'" ^^^ " 


"^^--^^•■!^ 


- 








-\>/^L 




0.1 


— 






- 








- 






1 1 1 1 1 1 1 


1 —1 


Vl- 



5 10 20 

Interresidue distance d (A) 



50 



Figure 2: Comparison of the experimental and predicted val- 
ues of the apparent stiffness 7(d). Experimental values of 7(d) 
(continuous black) and 7°(d) (dashed black), extracted from 
the dataset of 1500 NMR ensembles. Values of 7(d) predicted 
on the same dataset by the ENMio (dashed red); ENM13 (con- 
tinuous red); ENM50 (dashed blue); ENM50 (continuous blue). 



(e.g. hydrophobic core vs. exposed surface loops). How- 
ever, there is only a limited agreement between 7(3) and 
7°(s) (Fig. 3A-B): the correlation coefficient is equal to 
0.71, and 7(5) spans a much wider range of values. Be- 
yond individual amino acid preferences, the specifics of 
residue-residue interactions play thus a significant role in 
determining the extent of cooperativity in residue motions. 

Accuracy of elastic network models in reproducing 
the dynamical properties of proteins 

The computation of the apparent stiffness of residue pairs 
in a mean protein environment provides an interesting tool 
to probe the dynamical properties of proteins. It also gen- 
erates a very straightforward approach to assess the ability 
of coarse-grained models to reproduce accurately this gen- 
eral behavior. 

We focus here on four common variants of the residue- 
based ENM 



25 26 , which differ only by the functional 



form of the spring constants k. The dependence of k on 
the interresidue distance Vpij is defined by two parame- 
ters: the cutoff distance Id, above which residues i and 
J are considered disconnected, and the exponent a that 
determines how fast k decreases with increasing distances: 



Kp,^iENMl) = apH{ld - rp,^Y 



(3) 



where H is the Heaviside function. The value of the 
temperature-related factor Op is obtained, for each pro- 
tein independently, by fitting the predicted MSRF with 
the experimental ones. This ensures that the amplitude 
of the individual fiuctuations of the beads in the network 
is on average equal to that observed in the corresponding 
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Figure 3: Comparison of the experimental and predicted values of the apparent stiffness 7(s), in the dataset of 1500 NMR 
ensembles. For each amino acid, the median value of 7(s) over the 20 possible partners is given in units of ksT, along with 
the maximal, minimal, f^* and S''* quartile values. Outliers from these distributions are depicted as circles. (A) Experimental 
values of 7(s). (B) Experimental values of 7°(s). (C) Values of 7(s) predicted by the ENMfo. 



NMR ensemble, and that the predicted 7(s, d) values can 
thus be directly compared with those extracted from the 
NMR data. We consider the following models: ENM°o, 
ENM'j'a, ENM^o, ENM^g. These ENM variants were used 

to estimate the value of a1 for each pair of residues in 

• pi] 

the 1500 proteins of our NMR dataset [Methods) , and to 
subsequently compute ^{d) and 7(5) from eq. 2. 

Strikingly, all ENM variants systematically predict ^{d) 
values to be lower than the experimental ones, at least up 
to interresidue distances of 20-30 A(Fig. 2). These models 
overestimate thus the amplitude of pairwise fluctuations, 
relatively to the amplitude of individual fluctuations. For 
example, if two residues in a protein undergo highly cor- 
related motions, the amount of thermal energy necessary 
to induce a moderate variance on the distance between 
them will generate high variances on their individual co- 
ordinates. Consequently, if the motions of the beads of 
the ENM are less coordinated, adjusting the scale of the 
spring constants to reproduce the amplitude of individual 
fluctuations leads to an overestimated variance on the in- 
terresidue distances, and thus to lower ^{d) values. This 
problem is particularly apparent when k is assumed to 
decrease proportionally to the square of the interresidue 
distance, in the ENM^q. Although this model was shown 
to perform well in predicting MSRF values ^8] , our results 
suggest that it negates almost completely the coordinated 
aspect of residue motions. Indeed, as shown in Figure 2, 
the "j{d) values predicted by this model are very close to 
those obtained from the experimental data after removal 
of the correlations between the motions of the different 
residues {^°{d)). This observation is consistent with the 
extremely short atom-atom correlation length character- 
istic of the ENMgQ, recently estimated on the basis of an 
X-ray structure of Staphylococcal nuclease t25| . 



The ENM is often considered as an entropic model, not 
detailed enough to include sequence information in a rele- 
vant way 5p8 . It is therefore hardly surprising that com- 
mon ENM variants produce a poor description of the se- 
quence speciflcities of protein dynamics. Individual amino 
acid preferences for more or less densely connected regions 
are responsible for some variety in the predicted values of 
7(s) (Fig. 3C). However, this variety is far from matching 
the one observed in the experimental data, as shown by 
a much narrower range of 7(5) values, and a limited cor- 
relation coefficient with the experimental 7(3) values, e.g. 
0.62 for the ENMfg (Supplementary Fig. 3). 



Derivation of effective harmonic potentials 

Mean-force statistical potentials are commonly used to 
perform energetic evaluations of static protein structures 
These potentials do not describe explicitly the 
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"true" physical interactions, but provide effective energies 
of interaction in a mean protein environment, in the con- 
text of a more or less simplified structural representation. 
Similarly, within the ENM framework, k(s, d) defines for 
each pair of residues an harmonic interaction potential. 
This potential is also effective in nature, accounting im- 
plicitly for everything that is not included in the model 
(e.g. the surrounding water). Hence, we seek to identify 
the value of k yielding the most accurate reproduction 
of the dynamical behavior of each type of pair (s, d) in a 
mean protein environment, which is conveniently captured 
by the apparent stiffness 7(5, d). 

For that purpose, let us define E (s, d) as the energy 
of the elastic spring connecting two residues of type (s, d), 
in a mean protein environment: 
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Figure 4: Effective liarmonic potentials. (A) Spring constants of the sENMio, for tlie 210 amino acid pairs. (B) Spring 



constants of tlie dENM. Tlie daslied line corresponds to k < 
All K values are given in Supplementary Tables 2-5. 



(C) Spring constants of the sdENM for 3 amino-acid pairs. 



E °" (s,d) = -K{s,d)al{s,d) 



knT 



k{s, d) 
7(s, d) 



(4) 



where 7(5, d) is the apparent stiffness extracted from the 

experimental data. E (s, d) is unknown and is expected 
to be different for different pair types {s,d). The knowl- 
edge of 7(s, d) is thus not sufficient to estimate directly 
k(s, d). However, from any approximate set of spring con- 
stants K'{s,d), we may build the ENM for all proteins in 
our dataset, to reproduce the mean protein environment, 
and compute for each pair type an estimated value of the 
apparent stiffness, j'{s, d), and bond energy, E (s, d). 

Since the behavior of a given residue pair is highly de- 
pendent on its environment, we can make the assump- 
tion that E (s, d) is a relatively good approximation 



oi E 



bond 



(s, d), even if k'(s, d) 7^ k(s, d): 



-=l bond , , 

E {s,d) 



knT 



k'{s, d) 



E'°"'(s,d) 



(5) 



l'{s,d) 

Indeed, if the spring stiffness of a residue pair is underes- 
timated (k' < k), it will also appear as less rigid in the 
ENM than in the experimental data (7' < 7). A more de- 
tailed discussion is given in Supplementary Note 1. From 
eqs. 4 and 5, we devise thus an iterative procedure in 
which k(s, d) is updated at each step k by confronting the 
predicted values of the apparent stiffness, ^^{s,d), with 
the experimental ones, ^{s,d). It is expected to converge 
when 7j,(s,d) — >■ 7(5, d), that is, when the predictions of 
the model agree with the experimental data: 



i^k+i{s,d) = Kk{s,d) 



7(s,d) 



(6) 



7fe(s,d) 

We used this approach to derive, from the NMR data, 
four novel ENM variants: the distance-dependent dENM ; 



the sequence-dependent sENMio and SENM13, with a dis- 
tance cutoff of 10 and 13 A, respectively, and the sequence- 
and distance-dependent sdENM (Methods). Interestingly, 
the K values for the 210 amino acid pairs in the sENMio are 
relatively well correlated with the corresponding contact 
potentials [28], even though they result from totally dif- 
ferent approaches (Supplementary Fig. 4). Some common 
general trends can be identified, e.g. hydrophobic contacts 
tend to be associated with both favorable interaction en- 
ergies and large k values (Fig. 4A). However, the overall 
correspondence remains limited, indicating that the deter- 
minants of protein rigidity and stability are related, but 
distinct. The distance dependence of k in the dENM is 
remarkably similar to the r~^ power law that was previ- 
ously obtained by fitting against a 1.5ns MD trajectory 
of a C-phycocyanin dimer [6] (Fig. 4B) , although our new 
model presents more detailed features. Notably, k remains 
approximately constant up to interresidue distances of 5-6 
A, and then drops by about two orders of magnitude to 
reach a second plateau between 7 and 12 A. The k values 
of the sdENM are shown in Figure 4C, for a few amino 
acid pairs. This model not only combines the strengths of 
the sENM and the dENM, but also reveals the sequence 
specificity of the k distance dependence. The D-R pair, 
for example, is almost as rigid as I-I at short distances 
consistent with the formation of a salt bridge, but almost 
as flexible as G-G at larger distances. 



Performances of the new ENM 

The sdENM yields a much more accurate reproduction of 
the dynamical behavior of residue pairs in a mean protein 
environment than the common ENM variants, as demon- 
strated by the good agreement between experimental and 
predicted values of 7(5) (Fig. 5A, Supplementary Fig. 5), 
and 7(d) (Fig. 5B). 
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Figure 5: Performances of the sdENM. (A) Experimental and predicted values of 7(s), in the dataset of 1500 NMR ensembles. 
See also Supplementary Figures 3 and 5. (B) Experimental (continuous) and predicted (dashed) values of 7(d), in the dataset of 
1500 NMR ensembles (black), and in a single protein (PDB Ixqq) (green). See also Supplementary Figure 2. (C) Comparison 
of the ability of the sdENM and the ENM50 to correctly reproduce the fluctuations of a single protein, on the basis of a high 
quality structural ensemble of human ubiquitin (PDB Ixqq) 30 . 20 randomly selected residue pairs are connected by solid 
lines. Shades of blue indicate a better performance of the sdENM, while shades of red indicate a better performance of the 
ENM50. See also Supplementary Figure 6. 



Beyond its performances in a mean protein environment, 
our new model also brings highly notable improvements 
vifith respect to previously described ENM variants when 
it is applied to the specific architecture of a given protein. 
This is illustrated by two examples, on Figure 5C and 
Supplementary Figure 6. A more thorough assessment of 
the ability of the different ENM variants to capture the 
motions of individual proteins was performed on an in- 
dependent dataset of 349 proteins. The correlation coeffi- 
cient between predicted and observed MSRF (r^) has been 
widely used in the past but ignores the cooperativity inher- 
ent to protein dynamics, and presents other shortcomings. 
Therefore, we introduce a new measure (e^) that quanti- 
fies the relative error on the estimation of the variability 
of the distance between residue pairs, and is thus focused 
on the cooperative aspects of residue motions (Methods) . 



Table 1: Performances of different ENM variants. '''•' Aver- 
age correlation coefficient between experimental and measured 
MSRF. ^''' Average relative error on the ffuctuations of inter- 
residue distances. 
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0.69 
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0.48 


0.56 


0.60 


sdENM 


0.70 


0.48 


0.41 


0.49 


0.57 



Among the 4 previously described ENM variants, 
ENM5Q is better at predicting the individual residue fluc- 
tuations (Table 1). Interestingly, the ENM°q, with its sim- 
ple cutoff distance, appears superior when it comes to the 
reproduction of cooperative motions [ea- = 0.59). The 
new ENM variants based on our effective harmonic poten- 
tials present enhanced performances in comparison with 
the common models. In particular, the dENM reaches the 
same level of quality as the ENMgg for individual fluctua- 
tions (rs — 0.69), but surpasses even the ENM'j'q for the 
description of cooperativity {e^ — 0.54). On the other 
hand, the impact of introducing sequence specificity can 
be examined by comparing sENMio/13 with ENM^q/is, 
and sdENM with dENM. It consists in a slight improve- 
ment of the correlation coefficient rg, and a pronounced 
decrease of the error eo-, especially at short- (0-15 A) and 
mid- (15-30 A) range. 



Discussion 

For the last decades, statistical potentials extracted from 
datasets of known protein structures [27 ■ 29 have played 
a critical role in static analyses of protein structures, with 
major applications including structure prediction, protein- 
protein docking, or rational mutant design. Our study 
demonstrates that a similar approach can be taken to de- 
rive effective energy functions that are specifically adapted 
to the coarse-grained modeling of protein dynamics. 

More precisely, in the context of the ENM, we exploited 
a dataset of 1500 NMR ensembles to determine the values 
of the spring constants that describe best the behavior of 
pairs of residues, as a function of both their chemical na- 



ture and the distance separating them. The success of our 
approach is attested by a drastic enhancement of the abil- 
ity to accurately describe the cooperative nature of residue 
motions, with respect to previously described ENM vari- 
ants. Moreover, a definite advantage of our method is that 
the effective parameters characterizing the strength of the 
virtual bonds are directly extracted from the experimen- 
tal data without any a priori conception of their functional 
form. The fact that the distance dependence of the spring 
constants that we retrieve is quite similar to the r~^ power 
law, which was considered so far as underlying one of the 
best performing ENM variants [6 25 , also constitutes a 
major support to our approach. 

In our derivation scheme, the virtual bonds are 
parametrized so as to reproduce the behavior of amino 
acid pairs in a mean protein environment. The analysis 
of the ability of different models of protein dynamics to 
describe the motions of residues within this environment 
sheds an interesting new light on the properties of these 
models. In particular, our results indicate that previous 
ENM variants underestimate, sometimes dramatically, the 
rigidity of amino acid pairs at short- and mid-range. Our 
new model does however provide a much more accurate 
reproduction of the balance between short-range and long- 
range coordinated motions. This is arguably a crucial as- 
pect when considering, for example, the consequences of 
localized alterations induced by ligand binding on signal 
transduction or global conformational changes, such as in 
ATP-powered molecular motors. 

Importantly, our results also demonstrate that the ENM 
does not have to be exclusively structural, and that se- 
quence details may be allowed to play a major role in 
coarse-grained descriptions of protein dynamics. Thereby, 
this study paves the way towards comparative analyses 
of motions in proteins that share a similar structure but 
present differences in sequence. Such investigations will 
prove particularly interesting in the context of the ratio- 
nal design of (modified) proteins with controlled dynami- 
cal properties. Although we focused here on residue-based 
elastic network models, our approach is not limited to this 
particular family, and can be readily implemented to eval- 
uate and optimize other coarse-grained models of protein 
dynamics. Notably, the impact of chemical specificity on 
the dynamical behavior of residues should be even more 
accurately rendered by effective potentials based on a more 
detailed structural description. 

Methods 

NMR Dataset 



We retrieved, from the Protein Data Bank 31 , ensembles 



that present at most 30% sequence identity with one an- 
other. Entries under the SCOP classifications "Peptides" 
or "Membrane and cell surface proteins" were not consid- 
ered. The presence of ligands, DNA or RNA molecules, 
chain breaks, non-natural amino acids, and differences in 
the number of residues per model were also grounds for re- 
jection. These criteria led to the selection of 1849 distinct 
structural ensembles. A subset of 1500 ensembles was ran- 
domly selected for the main analysis, and the remaining 
349 were used to assess the performances of the different 
ENM variants. Unfolded C- or N-terminal tails were au- 
tomatically identified (MSRF values larger than twice the 
average for all residues in the protein) and removed from 
consideration. In each ensemble, the structure with the 
lowest root mean square deviation from the mean struc- 
ture, after superposition, is chosen as representative and 
used to build the ENM. 

Elastic network model 

The network is built by considering each residue as a sin- 
gle bead, placed at the position of the corresponding Cq, 
atom in the input structure, and connecting neighboring 
beads with Hookean springs pUH] . The ENM variants con- 
sidered here differ only by the form of the spring constant 
K as a function of interresidue distance and of amino acid 
types. In all variants, bonded interactions are described 
by a larger value of k, defined as ten times the value of 
K for non-bonded interactions at a separation of 3.5 A, 
averaged over all amino acid types. The potential energy 
of the network is given by: U = Yji<ji'^ij/'^){^ij ~ ^tj)"^^ 
where Vij and r° are the instantaneous and equilibrium 
distances between residues i and j, respectively. By def- 
inition, the input structure corresponds to the global en- 
ergy minimum, with [/ = 0. For a protein of n residues, 
the Hessian H of the system is the in x 3?! matrix of the 
second derivatives of U with respect to the spatial coordi- 
nates of the residues. The eigenvalue decomposition of H 
yields the covariance matrix C of the spatial coordinates, 
which constitutes the output of the model: 



3n-6 



1 



k=l *- 



(7) 



where the sum is performed over the 3n — 6 non-zero eigen- 
values Afc of H, and u^ are the corresponding eigenvectors. 
C is a 3ri X 3?! symmetrical matrix, constituted of n x n 
submatrices C, 



Jij- 



^< AxiAxj > 

< AyiAxj > 

< AziAxj > 



< AxiAyj > 

< Ay, Ay J > 

< Az.Ayj > 



< AxiAzj >^ 

< AyiAzj > 

< Az.Az, > , 



of at least 20 models from solution NMR experiments, cor- 
responding to monomeric proteins of at least 50 residues 



(8) 

where Axi, Ay^, and Azi correspond to the displacements 
of residue i from its equilibrium position, along the three 



Cartesian coordinates. The predicted MSRF of residue i 



is given by the trace of submatrix C^, 
(AxO^ > + < (A2/,)2 > + < (Azi)2 >. 



< (Ai?, 



>=< 



Variance of the interresidue distance 

For each pair of residues in a given protein p, the experi- 
mental value of this variance is readily computed from the 
NMR data: 



M„ 



'■p.3 Mp ^—i ^ 



pijm 



' pijJ 



(9) 



where Mp is the number of structures in the NMR ensem- 
ble, Tpijm the distance between the Cq atoms of residues i 
and j in structure m of protein p, and fpij the average dis- 
tance over all Mp structures. In the context of the ENM, 
CT^ values are estimated from the covariance matrix of 
the spatial coordinates, by standard statistical propaga- 
tion of uncertainty: 






(10) 



where J is the Jacobian of the distance Vpij as a function 
of the six coordinates (xi,yi,Zi,Xj,yj,Zj). This estima- 
tion oi af. . . relies on the validity of the first order Taylor 
expansion of the distance as function of the coordinates 
in the vicinity of the average distance. We ensured that 
no systematic bias arose from this approximation (Supple- 
mentary Fig. 7). To quantify the impact of the individ- 
ual motions of residues on their relative positions, we use 
eq. 10 to compute (ct° .)^ in an artificial construct where 
residue motions are not correlated. This is achieved by 
extracting the covariance matrix from the NMR data, and 
setting to zero all submatrices Cij where i ^ j ■ 

Iterative procedure 

The values of the spring constants of the new ENM vari- 
ants were derived from the dataset of 1500 NMR ensem- 
bles using eq 6. For the dENM, sENMio and sENMia, the 
initial values of the spring constants were set equal to the 
experimental values of the apparent stiffness: Ko{d) — j{d) 
or Ko{s) = 7(s). Note that the 7(5) values were computed 
by considering only residue pairs separated by a distance 
lower than the cutoff of 10 or 13 A. For the sdENM, the 
K(){s,d) values were set equal to the final values of the 
spring constants in the dENM, K{d), for all amino acid 
types. A correction for sparse data was devised to ensure 
that k(s, d) tends to K{d) when the number of residue pairs 
of type (s, d) is too small to obtain relevant estimations of 
(T^(s,(i). Instead of eq. 2, we used the following definition 
to compute both the experimental and predicted apparent 
stiffness: 



7(s, d) 
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(11) 



where Ngd — J2p=i^pi^'d)^p^ ^p{s,d) is the number 
of pairs of type (s, d) in protein p, Mp is the number of 
structures in the NMR ensemble of protein p, and S is an 
adjustable parameter set to 500. 

The K values were rescaled after each iteration step, so 
that the average value of k over all amino acid types is 
equal to 1 for pairs separated by a distance of 6 A. Residue 
pairs of a given type (s, d) for which k{s, d) < 0.001 (after 
rescaling), were considered to establish no direct interac- 
tion: k{s, d) was set to 0, and they were no longer con- 
sidered in the iterative procedure. The performances of 
the new ENM variants after the first nine iteration steps 
are reported in Supplementary Table 1. The procedure 
converged rapidly for the dENM and the sdENM, and the 
final models were selected after 5 and 3 iteration steps, 
respectively. The sENM variants did not improve signifi- 
cantly with respect to the initial models (fc = 0), indicat- 
ing that k{s) = 7(5) is a good approximation, contrary to 
K{d) = 7((i). The procedure was thus stopped after one 
iteration step, for both the sENMio and the SENM13. 

Performance measures 

The ability of coarse-grained models to accurately describe 
protein dynamics is commonly evaluated by computing 
the Pearson correlation coefficient between predicted and 
experimental MSRF, < (Ai?^)^ >, over all i = l,...,n 
residues of a given protein: 
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(12) 



B) 



where, for simplicity, Bi was used instead of < (Ai?^)^ >. 
There is indeed a direct relationship between the MSRF 
and the cristallographic B-factors: Bi — (87r^/3) < 
(ARi)^ >. Bl^^ and B^'^ correspond thus here to the 
MSRF of residue i extracted from the NMR data and pre- 
dicted by the ENM, respectively. The scale of the pre- 
dicted MSRF values depends on the scale of the spring 
constants, which are only defined up to a constant fac- 
tor. This factor was determined, for each protein indepen- 
dently, by fitting the scales of the predicted and experi- 
mental MSRF, i.e. to ensure that: 



B 
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(13) 



Although it has been widely used in previous studies, tb 
is probably not the most adequate measure to evaluate 



the performances of coarse-grained models of protein dy- 
namics. As pointed out previously 24 25 , it does indeed 
present several shortcomings: e.g. it is strongly affected 
by the presence of highly flexible regions, and does not 
account for possible flaws leading to an intercept of the 
regression line different from zero. Most importantly, the 
MSRF describe individual fluctuations but provide no in- 
formation about the cooperative aspects of residue mo- 
tions. The quality of the MSRF predictions gives thus no 
guarantee about the ability of the model to describe the 
cooperativity of protein dynamics. The ENMgg provides 
an interesting example, for it performs quite well in pre- 
dicting the MSRF but basically negates all cooperativity 
(Fig. I Table [l]). 

Therefore, we introduce a new measure that exploits 
the information contained in the correlation matrix C, to 
quantify the error on the estimation of the fluctuations of 
the interresidue distances: 



^a — \ 




(cxp) 



,(prc) 



(14) 



where iVp is the number of non-bonded residue pairs in 
protein p, Or^^ and (Tr-p™ ^^^ ^'^ experimental (eq. 9) 
and predicted (eq. 10) values of 0^^^^ , respectively. (7rp[j 
is obtained after fitting the experimental MSRF with the 
predicted ones (eq. 13). The error is normalized by (t° .., 
which is the expected value of cr^ ; given the individual, 
anisotropic, fluctuations of both residues extracted from 
the NMR data, but neglecting all correlations between 
their respective motions. This normalization ensures that 
the contributions of the different pairs of residues are 
equivalent, and that the measure is not dominated by 
highly flexible regions. 

Both rs and eo- are computed independently for each of 
the 349 proteins of our test set, and the average values are 
reported. We also report the short- (e^^), mid- (e^^), and 
long-range (ej;^) contributions to e^, obtained by consid- 
ering only pairs separated by 0-15 A, 15-30 A, and more 
than 30 A, respectively. 
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